if (!requireNamespace("tidyverse", quietly = TRUE))
install.packages('tidyverse')
if (!requireNamespace("Seurat", quietly = TRUE))
install.packages('Seurat')
if (!requireNamespace("colorBlindness", quietly = TRUE))
install.packages('colorBlindness')
if (!requireNamespace("RColorBrewer", quietly = TRUE))
install.packages('RColorBrewer')
if (!requireNamespace("DoubletFinder", quietly = TRUE))
remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
if (!requireNamespace("ggridges", quietly = TRUE)) {
install.packages("ggridges")
}5 - Quality Control
Quality Control
Introduction
This notebook reviews the quality control steps recommended at the beginning and at each subsequent annotation step when analyzing scRNA-seq. To clean up the cell by gene matrix, we aim to remove technical artifacts such as doublets or empty droplets. In this notebook, we will look at UMI and gene counts, filter out cells with high mitochondrial levels, investigate potential doublets (using DoubletFinder), and compare the cell by gene distribution before and after filtering.
Key Takeaways
QC is the process of filtering out “bad” data. This makes it a subjective process, making well-informed decisions is key to avoid removing biological signal. In this context, being conservative in what we call “poor quality” cells is best practice.
QC metrics include UMI and gene counts, mitochondrial gene expression, and doublet detection.
Looking at QC metric individually is not enough to grasp the whole picture. We need to look at how these metrics covary to make informed decisions.
The main pitfall in this step is to be too stringent. This may lead to removing true biological signal. Some examples of cells that might be filtered out during the QC process are neutrophils, platelets, and erythrocytes since they have low library size and complexity.
Doublets can be confounded with transitioning cells or cells with mixed phenotypes. Therefore, it is very important that we are very careful when calling a cell a doublet! Otherwise, we might be removing very interesting biology.
When in doubt, keep it. Once you remove something, you’ll never get it back.
Why are mitochondrial genes highly expressed and found in many datasets?
There is a disproportionate number of mitochondrial-high cells because cells upregulate mitochondrial genes during stress (such as during library prep) and near death.
From Orchestrating single-cell analysis with Bioconductor:
“High proportions (of mitochondria) are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), presumably because of loss of cytoplasmic RNA from perforated cells. The reasoning is that, in the presence of modest damage, the holes in the cell membrane permit efflux of individual transcript molecules but are too small to allow mitochondria to escape, leading to a relative enrichment of mitochondrial transcripts.”
Glossary
Doublet:
A doublet is a technical event when two or more cells end up within the same droplet. These cause confusion in the gene expression distribution especially when cells of two different cell types (different expected gene expression profiles) are in the same droplet.
Heterotypic doublets vs Homotypic doublets
A heterotypic doublet is a doublet with cells from distinct gene expression profiles that are likely different cell types.
A homotypic doublet is a doublet with cells with similar gene expression profiles that are not necessarily different cell types but are suspicious because they express around double the amount that other cells do.
Resources
Current best practices in single-cell RNA-seq analysis: a tutorial
Bioconductor: Orchestrating single-cell analysis with Bioconductor
-
A software that can access most popular demulitplexing and doublet detection tools. I used it as a review and consolidation of resources.
Loading Libraries and Data
library(Seurat)
library(tidyverse)
library(RColorBrewer)
library(colorBlindness)
library(DoubletFinder)
library(ggridges)
set.seed(687)# Load Seurat object
se <- readRDS('../data/Covid_Flu_Seurat_Object.rds')
seAn object of class Seurat
33234 features across 59572 samples within 1 assay
Active assay: RNA (33234 features, 0 variable features)
1 layer present: counts
# Set color palette for cell types
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))
donor_pal <- c(
"#66C2A4", "#41AE76", "#238B45", "#006D2C",
"#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
"#FFEDA0", "#FED976", "#FEB24C", "#f79736", "#FD8D3C",
"#FC4E2A", "#E31A1C", "#BD0026", "#ad393b", "#800000", "#800050")
names(donor_pal) <- c(
"Normal 1", "Normal 2", "Normal 3", "Normal 4",
"Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
"nCoV 1", "nCoV 2", "nCoV 3", "nCoV 4", "nCoV 5",
"nCoV 6", "nCoV 7", "nCoV 8", "nCoV 9", "nCoV 10", "nCoV 11"
)Filter out the overexpressed genes
Let’s look at the number of counts per barcode, the number of genes per barcode, and the percentage of mitochondrial genes per barcode. These distributions will help us determine the quality of the data by allowing us to see any outliers. These outliers could be dying cells with high mitochondrial counts or doublets with high gene counts.
In public datasets, use colnames() to find the processing and metadata already analyzed.
# Correct calculation of the number of metadata variables
vars <- length(colnames(se@meta.data))
# Use paste0 for concatenation without spaces
print(paste0("There are ", vars, " metadata variables in the seurat object."))[1] "There are 49 metadata variables in the seurat object."
# Sample the first 10 metadata variables
colnames(se@meta.data)[1:10] [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "Sample ID" "Disease group" "Comorbidity" "Hospital day" "WBC per microL" "Neutrophil per microL (%)" "Lymphocyte per microL (%)"
Library Size
nCount_RNA is the number of UMIs per cell. This is a measure of the total number of transcripts detected in a cell and is the way to visualize library size. Cells with low nCount_RNA values may be dying cells or biologically relevant cell types with low mRNA content such as neutrophils. Cells with high nCount_RNA values may be doublets or very large cells like macrophages. Note the figure below plots UMIs on a log10 scale.
ggplot(se@meta.data, aes(x = nCount_RNA)) +
geom_density(color = "#6abcb6", fill = "#6abcb6", alpha = 0.7) +
scale_x_continuous(
transform = "log10",
labels = scales::unit_format(unit = "K", scale = 1e-3)) +
theme_classic() +
scale_color_manual(values = pal)Plotting the UMIs by cell type shows that if we were to make a hard cut off at 500 UMIs, we would lose a large portion of platelets and RBCs.
ggplot(se@meta.data, aes(x = nCount_RNA, fill = Celltype)) +
geom_density(alpha = 0.7) +
# Hard cutoff at 500 elicits loss in important cells
geom_vline(xintercept = 500, linetype = "dashed", color = "red") +
scale_x_continuous(
transform = "log10",
labels = scales::unit_format(unit = "K", scale = 1e-3)) +
theme_classic() +
scale_color_manual(values = pal)Library Complexity
nFeature_RNA is the number of genes detected in a cell. Similarly, we want to look for cells with significantly low or high feature count. The feature count distribution gives us an idea of library complexity while UMI counts looks at library size.
ggplot(se@meta.data, aes(x = nFeature_RNA)) +
geom_density(color = "#6abcb6", fill = "#6abcb6", alpha = 0.7) +
scale_x_continuous(
labels = scales::unit_format(unit = "K", scale = 1e-3)) +
theme_classic() +
scale_color_manual(values = pal)A cell with a very high high number of genes per cell could be indicative of doublets or large cells. Cells on the left side of the distribution with low gene counts could be low quality cells or just RBCs or platelets.
Percentage of mitochondrial genes
Notice the object already has Percent of mitochondrial genes calculated. We are going to calculate them again to show the process.
Seurat’s PercentageFeatureSet function calculates the percentage of a feature set in each cell. This function is useful for calculating the percentage of mitochondrial genes, ribosomal genes, or any other gene set of interest using regular expression (string matching) to filter these out by gene name.
# Calculate the percentage of mitochondrial genes
se <- PercentageFeatureSet(se, pattern = "^MT-", col.name = "perc.mt")
# Calculate the percentage of ribosomal genes
se <- PercentageFeatureSet(se, pattern = "^RPS|^RPL", col.name = "perc.ribo")
# Calculate the percentage of hemoglobin genes
se <- PercentageFeatureSet(se, pattern = "^HB[^(P)]", col.name = "perc.hb")Check the mitochondrial percentage across samples
# Plot Percent of mitochondrial genes by Sample
VlnPlot(se,
group.by = "Sample ID",
features = "perc.mt",
# log = TRUE,
pt.size = 0,
) +
NoLegend() +
scale_fill_manual(values = donor_pal)# Plot Percent of ribosomal genes by Sample
VlnPlot(se,
group.by = "Sample ID",
features = "perc.ribo",
# log = TRUE,
pt.size = 0
) +
NoLegend() +
scale_fill_manual(values = donor_pal)# Plot Percent of hemoglobin genes by Sample
VlnPlot(se,
group.by = "Sample ID",
features = "perc.hb",
# log = TRUE,
pt.size = 0
) +
NoLegend() +
scale_color_manual(values = scale_fill_manual(values = donor_pal))# Plot features by Sample together
VlnPlot(se,
group.by = "Sample ID",
features = c("perc.mt", "perc.ribo", "perc.hb"),
# log = TRUE,
pt.size = 0
) &
NoLegend() &
scale_fill_manual(values = donor_pal)With these distributions, something notable is going on in Flu 5. Let’s perform other quality control visualizations to see if we can find out what is happening.
# Comparison of mitochondrial and hemoglobin genes
p1 <- se@meta.data %>%
ggplot(., aes(x = perc.mt, y = perc.hb, color = Celltype)) +
geom_point() +
scale_color_manual(values = pal) +
theme_classic()
p2 <- se@meta.data %>%
ggplot(., aes(x = perc.mt, y = perc.hb, color = `Sample ID`)) +
geom_point() +
scale_color_manual(values = donor_pal) +
theme_classic()
p1 | p2Plotting the hemoglobin genes by mitochondrial genes shows that the Flu 5 has low mitochondria and high hemoglobin suggesting its sample contained a lot of blood and is not poor quality.
Feature Covariation
Looking at how QC features covary between them is very important. It gives us a better sense of what is going on and allows us to make better decisions.
This covariation plot colors by a specified variable in order to see its prevalence across the data. Below, the distribution of mitochondrial percentage across the dataset is shown in relation to the library size and complexity.
# Percent Mitochondrial Genes
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = perc.mt)) +
geom_point() +
theme_classic() +
scale_color_gradient(low = "yellow", high = "red")# Percent MT Genes per sample
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = perc.mt)) +
geom_point() +
facet_wrap(~`Sample ID`) +
theme_classic() +
scale_color_gradient(low = "yellow", high = "red") +
scale_x_continuous(
labels = scales::unit_format(unit = "K", scale = 1e-3))It is not strange that we lack cells with high mitochondrial percentage since the authors mention they filter out cells with >15% mitochondrial genes. This explains the cap we see at that range above.
Below we look at the distribution of ribosomal expression. Ribosomal percentage can be helpful when studying tumors as tumoral cells tend to have higher ribosomal percentage.
# Percent Ribosomal Genes
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = perc.ribo)) +
geom_point() +
theme_classic() +
scale_color_gradient(low = "yellow", high = "red")# Percent MT Genes per sample
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = perc.ribo)) +
geom_point() +
facet_wrap(~`Sample ID`) +
theme_classic() +
scale_color_gradient(low = "yellow", high = "red") +
scale_x_continuous(
labels = scales::unit_format(unit = "K", scale = 1e-3))Lastly, let’s plot the hemoglobin % distribution and investigate the likely bloodier samples.
# Percent Hemoglobin Genes
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = perc.hb)) +
geom_point() +
theme_classic() +
scale_color_gradient(low = "yellow", high = "red")# Percent hemoglobin Genes per sample
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = perc.hb)) +
geom_point(alpha = 0.5) +
facet_wrap(~`Sample ID`) +
theme_classic() +
scale_color_gradient(low = "yellow", high = "red") +
scale_x_continuous(
labels = scales::unit_format(unit = "K", scale = 1e-3))The odd distribution of Flu 5 makes more sense seeing its high hemoglobin content with lower cell counts.
Since we have the annotations for this dataset, let’s plot them to see how they support this.
# Plot by cell type
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = Celltype)) +
geom_point() +
theme_classic() +
scale_color_manual(values = pal)# Percent MT Genes per sample
se@meta.data %>%
ggplot(., aes(x = nCount_RNA, y = nFeature_RNA, color = Celltype)) +
geom_point() +
facet_wrap(~Celltype) +
theme_classic() +
scale_color_manual(values = pal)We can clearly see how RBCs sit in the bottom right blob showing high percentage of hemoglobin genes.
Add Batch Information
Preprocessed Seurat object’s will often contain batch and sample information. From the paper for this dataset, the batch information is found in a supplementary table and the authors claimed most doublets were classified as “Uncategorized”. Because they did not include the doublet information in the metadata, we will add it in this section.
Packages to find doublets in both R and Python run one batch at a time as doublets are only introduced as a technical error at the 10X GEM run level during the encapsulation of cells within oil droplets. We imported the batch information so we could subset the larger object and filter out doublets per batch. Doublet detection algorithms must be run by 10X GEM run!
# Make a data frame and extract barcode and Sample ID metadata from se obj
info <- data.frame(Barcode = colnames(se), "Sample ID" = se@meta.data$'Sample ID')
colnames(info) <- gsub("\\.", " ", colnames(info))
# Load in batch information
batch_info <- read.csv("../data/batch_info.csv")
# Rename batch_info column names to have a space instead of "."
colnames(batch_info) <- gsub("\\.", " ", colnames(batch_info))
batch_info <- batch_info %>%
select("Sample ID", "Experimental batch")
head(batch_info) Sample ID Experimental batch
1 Flu 1 1
2 Flu 2 1
3 Flu 3 2
4 Flu 4 2
5 Flu 5 2
6 nCoV 1 1
# merge the batch_info with the info dataframe
info <- left_join(info, batch_info, by = "Sample ID") %>%
column_to_rownames("Barcode")
# Add batch numbers to the metadata
se <- AddMetaData(se, info)DoubletFinder
Doublet detection and removal is sensitive because you don’t want to remove any valuable outliers but still take into consideration the noisiness of the data. There are many different packages that can be used to detect doublets. If you are familiar with Python, we recommend Scrublet. But for the sake of staying in one programming language, DoubletFinder was found to be one of the most accurate R packages.
DoubletFinder uses a nearest neighbor approach to identify doublets. The package has a function called doubletFinder that takes in a Seurat object and returns a Seurat object with doublet information added in the metadata.
How it Works
- DoubletFinder combines random cell’s gene expression profiles to create “fake doublets.”
- Artificial doublets are inserted into the data
- Dimensionality reduction and a KNN graph is built with our cells and the fake doublets.
- On the k-nearest neighbor graph each cell has k most-similar neighbors. pANN is the proportion of each cell’s neighbors that are “fake doublets”.
- A high pANN value indicates that a larger proportion of the cells around a specific cell in the kNN graph are “fake doublets” suggesting that the cell has a transcriptome similar to a doublet.
Cells in transitioning states or cells with mixed phenotpyes can be mistaken for doublets. It is important not to disregard cells just because they have a large doublet score (pANN value), there might be interesting biological information in these cells!
DoubletFinder allows you to set a threshold and will give each cell a categorical result of ‘Singlet’ or ‘Doublet’. We suggest looking at the raw pANN values to understand the distribution of these cells and make threshold adjustments accordingly.
Parameters
pK parameter: PC neighborhood size that is used to compute pANN, and it is expressed as a proportion of the merged real-artificial data. This metric determines the accuracy of the classification model. The higher the pK, the more stringent the model is in classifying doublets. The lower the pK, the more lenient the model is in classifying doublets.
pN parameter: default number of doublets we expect to find (default = 0.25). pN functions as the threshold.
pANN: proportion of artificial nearest neighbors or the “doublet score”.
Note: DoubletFinder is only sensitive to heterotypic doublets (transcriptionally-distinct doublets) so the developer suggests using a cell-type annotations (here: batch_seurat_object@meta.data$seurat_clusters) to model the expected proportion of homotypic (transcriptionally-similar) doublets.
library(DoubletFinder)
exp_btch <- unique(se@meta.data$`Experimental batch`)# Run doubletFinder
process_batch <- function(batch) {
print(batch)
# 1 - Subset Seurat object by batch
batch_seurat_object <- subset(se, cells = which(se@meta.data$'Experimental batch' == batch))
# 2 - Process count data of that subseted Seurat object
batch_seurat_object <- batch_seurat_object %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA()
# 3 - nExp defines the expected pANN threshold used to make final doublet-decisions
annotations <- batch_seurat_object$Celltype
homotypic.prop <- modelHomotypic(annotations)
nExp_poi <- round(0.075 * nrow(se@meta.data)) # Assuming 7.5% doublet formation rate
nExp_poi.adj <- round(nExp_poi * (1 - homotypic.prop))
# 4 - Run doubletFinder
batch_seurat_object <- doubletFinder(
batch_seurat_object,
PCs = 1:20,
pN = 0.1,
pK = 50 / ncol(batch_seurat_object), # set a k to aproximately 50 cells per neighborhood for pANN
nExp = nExp_poi.adj,
reuse.pANN = FALSE, # If TRUE, pANN will be reused from a previous run
sct = FALSE) # If TRUE, sctransform will be used for preprocessing
# 7 - Format doublet information for return dataframe
pANN <- colnames(batch_seurat_object@meta.data) %>%
keep(str_detect(colnames(batch_seurat_object@meta.data), '^pANN*')) # pANN_0.25_0.3_1000
DF_class <- colnames(batch_seurat_object@meta.data) %>%
keep(str_detect(colnames(batch_seurat_object@meta.data), '^DF.classifications*')) # DF.classifications_0.25_0.3_1000
## Extract values from pANN variable name
params <- gsub("^pANN_", "", pANN) # Remove "pANN_"
params <- strsplit(params, "_")[[1]] # Split by "_"
pN <- params[1] # Extract "0.1"
pK <- params[2] # Extract "0.1"
doublet_run <- params[3] # Extract "1000"
## Create new columns "pN", "pK", and "doublet_run"
colnames(batch_seurat_object@meta.data)[colnames(batch_seurat_object@meta.data) == pANN] <- "pANN"
colnames(batch_seurat_object@meta.data)[colnames(batch_seurat_object@meta.data) == DF_class] <- "DF_class"
batch_seurat_object@meta.data$pN <- pN
batch_seurat_object@meta.data$pK <- pK
batch_seurat_object@meta.data$doublet_run <- doublet_run
# 8 - Return DF of doublet information here
df <- data.frame(
Barcode = colnames(batch_seurat_object),
batch = batch_seurat_object@meta.data$'Experimental batch',
pANN = batch_seurat_object@meta.data$'pANN',
DF_class = batch_seurat_object@meta.data$'DF_class',
doublet_run = batch_seurat_object@meta.data$doublet_run,
pK = batch_seurat_object@meta.data$pK,
pN = batch_seurat_object@meta.data$pN
)
rownames(df) <- df$Barcode
return(df)
}
processed_doublets <- lapply(exp_btch, process_batch)[1] 3
[1] "Creating 2084 artificial doublets..."
[1] "Creating Seurat object..."
[1] "Normalizing Seurat object..."
[1] "Finding variable genes..."
[1] "Scaling data..."
[1] "Running PCA..."
[1] "Calculating PC distance matrix..."
[1] "Computing pANN..."
[1] "Classifying doublets.."
[1] 4
[1] "Creating 1834 artificial doublets..."
[1] "Creating Seurat object..."
[1] "Normalizing Seurat object..."
[1] "Finding variable genes..."
[1] "Scaling data..."
[1] "Running PCA..."
[1] "Calculating PC distance matrix..."
[1] "Computing pANN..."
[1] "Classifying doublets.."
[1] 2
[1] "Creating 958 artificial doublets..."
[1] "Creating Seurat object..."
[1] "Normalizing Seurat object..."
[1] "Finding variable genes..."
[1] "Scaling data..."
[1] "Running PCA..."
[1] "Calculating PC distance matrix..."
[1] "Computing pANN..."
[1] "Classifying doublets.."
[1] 1
[1] "Creating 1743 artificial doublets..."
[1] "Creating Seurat object..."
[1] "Normalizing Seurat object..."
[1] "Finding variable genes..."
[1] "Scaling data..."
[1] "Running PCA..."
[1] "Calculating PC distance matrix..."
[1] "Computing pANN..."
[1] "Classifying doublets.."
# Add doublet information to the main seurat object
dbl_info <- bind_rows(processed_doublets)dbl_info <- dbl_info %>% select(-Barcode)
se <- AddMetaData(
object = se,
metadata = dbl_info
)# Save Doublet information in Seurat object
saveRDS(se, '../data/Covid_Flu_Seurat_Object_DF.rds')Load Pre-Run DF object
se <- readRDS('../data/Covid_Flu_Seurat_Object_DF.rds')Perform Preprocessing
se <- se %>%
NormalizeData(verbose = FALSE) %>%
FindVariableFeatures(
method = "vst",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>%
RunPCA(verbose = FALSE) %>%
RunUMAP(dims = 1:30, verbose = FALSE)# Plot UMAP colored by batch
DimPlot(se, reduction = "umap", group.by = 'DF_class') +
labs(title = "UMAP Plot Colored by Singlet/Doublet classification")Let’s look at the distribution of pANN values
# Plot UMAP by pANN
p1 <- FeaturePlot(se, reduction = "umap", features = "pANN") +
labs(title = "UMAP Plot Colored by pANN")
# Plot UMAP by cell type
p2 <- DimPlot(se, reduction = "umap", group.by = "Celltype") +
labs(title = "UMAP Plot Colored by Cell Type") +
scale_color_manual(values = pal)
p1 | p2# DimPlot(se, reduction = "umap", group.by = "Celltype") +
# labs(title = "UMAP Plot Colored by Batch")Let’s see if celltypes annotated by the authors are show the pANN distribution.
# Assuming se@meta.data is your data frame
ggplot(se@meta.data, aes(x = pANN, y = Celltype, fill = Celltype)) +
geom_density_ridges(alpha = 0.5) +
theme_classic() +
scale_color_manual(values = pal)We can see how Uncategorized 2 has a high pANN but also intermediate Monocytes. As mentioned in the paper the doublets in this dataset are annotated as Uncategorized which makes sense. However, those cells labeled as intermediate monocyte could be a case where we have a cell with a mixed phenotype of two other cell types in the dataset (classical and nonclassical monocytes). This leads to that population having a high pANN score without them being a doublet.
Filter out cells
Label cells as low quality according to three metrics:
Any cell with less that 500 UMIs
Any cell with less than 200 genes
Any cell with more than 15% mitochondrial genes
se@meta.data <- se@meta.data %>%
mutate(
quality = if_else(
nFeature_RNA > 200 & nCount_RNA > 500 & perc.mt < 15, "good quality", "bad quality"),
quality = factor(quality, levels = c("bad quality", "good quality"))
)s2 <- FeaturePlot(se, reduction = "umap", features = "pANN") +
labs(title = "UMAP Plot Colored by pANN") +
scale_fill_gradient(limits = c(0, 1))
# Plot UMAP of good quality cells
s3 <- DimPlot(se, reduction = "umap", group.by = "quality") +
labs(title = "UMAP Plot Colored by Cell Quality")
s4 <- DimPlot(se, reduction = "umap", group.by = "Celltype") +
labs(title = "UMAP Plot Colored by Cell Type") +
scale_color_manual(values = pal)
s2 | s3 | s4Lack of overlap between pANN dist and doublet dist is expected since pANN identifies potential doublets (expected to have higher library size and complexity) while the cell quality filter aims to identify the opposite pattern.
We will keep the potential doublets in the data because they may be biologically relevant.
Differential Gene Expression comparison between good and bad quality cells
Lastly, we carry out a differential gene expression analysis between poor and good quality cells. This is aimed at making sure that the cells labelled as “bad quality” don’t make up a specific cell type that happens to have small library size and complexity. Ideally, when removing these cells we expect to see that the genes differentially expressed in “bad quality” are just mitochndrial ones.
# Using the non-batched 'discard' vector for demonstration purposes,
# as it has more cells for stable calculation of 'lost'.
discard <- se$quality == "bad quality"
lost <- sparseMatrixStats::rowMeans2(se@assays$RNA$counts[, discard])
kept <- sparseMatrixStats::rowMeans2(se@assays$RNA$counts[, !discard])
library(edgeR)
logged <- cpm(cbind(lost, kept), log = TRUE, prior.count = 2)
logFC <- logged[,1] - logged[,2]
abundance <- rowMeans(logged)
df <- data.frame(logFC, abundance, gene = row.names(se)) %>%
arrange(desc(logFC)) %>%
mutate(rank = row_number()) %>%
mutate(gene_txt = if_else(rank <= 30 | rank > n() - 30, gene, NA_character_))
# Assuming you have a data frame with logFC and abundance
df %>%
mutate(
gene_type = case_when(
str_detect(gene, pattern = "^MT-") ~ "mt",
str_detect(gene, pattern = "^RPS|^RPL") ~ "rb",
str_detect(gene, pattern = "^HB[^(P)]") ~ "hb",
TRUE ~ "other"
)
) %>%
ggplot(aes(x = abundance, y = logFC, color = gene_type, label = gene_txt)) +
geom_point() +
ggrepel::geom_text_repel() +
geom_hline(yintercept = 1, color = "red", linetype = "dashed") +
geom_hline(yintercept = -1, color = "red", linetype = "dashed") +
annotate(
"segment",
x = 12, xend = 12,
y = 1, yend = 3,
color = "blue",
arrow = arrow(length = unit(0.2, "cm"))) +
annotate(
"text",
x = 13, y = 2,
label = "Overexpressed in poor quality cells",
vjust = -1,
hjust = 0.3,
color = "blue") +
annotate(
"segment",
x = 12, xend = 12,
y = -Inf, yend = -1,
color = "blue",
arrow = arrow(length = unit(0.2, "cm"), ends = "first")) +
annotate(
"text",
x = 12, y = -1.5,
label = "Overexpressed in good quality cells",
vjust = 1,
hjust = -0.1,
color = "blue") +
theme_classic() +
labs(
title = "Average Gene Expression vs. Log2 Fold Change",
x = "Average Gene Expression",
y = "Log2 Fold Change") Some genes are differentially expressed in the “poor quality” cells group. The top genes are platelet related which indicates there might be an enrichment of this cell type in that group. We would need to assess what the other genes are and if they are related to any celltype to make a decision on if its worth removing these cells or not.
Moreover, in this scenario it makes sense for ribosomal genes to be overexpressed in good quality cells since platelets have fewer ribosomal activity going on.
Resave Seurat object
saveRDS(se, '../data/Covid_Flu_Seurat_Object_Quality.rds')Session Info
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.42.4 limma_3.56.2 KernSmooth_2.23-22 fields_15.2 viridisLite_0.4.2 spam_2.10-0 ggridges_0.5.6 DoubletFinder_2.0.4 colorBlindness_0.1.9 RColorBrewer_1.1-3 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0 Seurat_5.0.3 SeuratObject_5.0.2 sp_2.1-3
loaded via a namespace (and not attached):
[1] rstudioapi_0.15.0 jsonlite_1.8.8 magrittr_2.0.3 ggbeeswarm_0.7.2 spatstat.utils_3.0-4 farver_2.1.1 rmarkdown_2.26 vctrs_0.6.5 ROCR_1.0-11 spatstat.explore_3.2-6 htmltools_0.5.7 gridGraphics_0.5-1 sctransform_0.4.1 parallelly_1.37.0 htmlwidgets_1.6.4 ica_1.0-3 plyr_1.8.9 plotly_4.10.4 zoo_1.8-12 igraph_2.0.2 mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.6-5 R6_2.5.1 fastmap_1.1.1 MatrixGenerics_1.12.3 fitdistrplus_1.1-11 future_1.33.1 shiny_1.8.0 digest_0.6.34 colorspace_2.1-0 patchwork_1.2.0 tensor_1.5 RSpectra_0.16-1 irlba_2.3.5.1 labeling_0.4.3 progressr_0.14.0 fansi_1.0.6 spatstat.sparse_3.0-3 timechange_0.2.0 httr_1.4.7 polyclip_1.10-6 abind_1.4-5 compiler_4.3.1 withr_3.0.0 fastDummies_1.7.3
[48] maps_3.4.2 MASS_7.3-60 tools_4.3.1 vipor_0.4.5 lmtest_0.9-40 beeswarm_0.4.0 httpuv_1.6.14 future.apply_1.11.1 goftest_1.2-3 glue_1.7.0 nlme_3.1-163 promises_1.2.1 grid_4.3.1 Rtsne_0.17 cluster_2.1.4 reshape2_1.4.4 generics_0.1.3 gtable_0.3.4 spatstat.data_3.0-4 tzdb_0.4.0 data.table_1.15.0 hms_1.1.3 utf8_1.2.4 spatstat.geom_3.2-8 RcppAnnoy_0.0.22 ggrepel_0.9.5 RANN_2.6.1 pillar_1.9.0 RcppHNSW_0.6.0 later_1.3.2 splines_4.3.1 lattice_0.21-8 survival_3.5-7 deldir_2.0-2 tidyselect_1.2.0 locfit_1.5-9.8 miniUI_0.1.1.1 pbapply_1.7-2 knitr_1.45 gridExtra_2.3 scattermore_1.2 xfun_0.42 matrixStats_1.2.0 stringi_1.8.3 lazyeval_0.2.2 yaml_2.3.8 evaluate_0.23
[95] codetools_0.2-19 cli_3.6.2 uwot_0.1.16 xtable_1.8-4 reticulate_1.35.0.9000 munsell_0.5.0 Rcpp_1.0.12 globals_0.16.2 spatstat.random_3.2-2 png_0.1-8 ggrastr_1.0.2 parallel_4.3.1 ellipsis_0.3.2 dotCall64_1.1-1 sparseMatrixStats_1.12.2 listenv_0.9.1 scales_1.3.0 leiden_0.4.3.1 rlang_1.1.3 cowplot_1.1.3